Figure 1: Decision Tree
This function accepts column vector as an argument. Then it calculates the 1st quartile, 3rd quartiles and Inter Quartile Range which is used to compute and return outliers in the column vector.
#Function to calculate IQR and permissible outliers
getoutandquant <- function(x) {
q1<-quantile(x)[[2]]
q3<-quantile(x)[[4]]
IQR<-q3-q1
out1<-q3+(1.5)*IQR
out2<-q1-(1.5)*IQR
#Finding the list of points which are outliers ]
out<-x[x>out1]
out2<-x[x<out2]
outliers<-tibble(x=c(out,out2),y=0)
return(outliers)
}
The function accepts dataframe,the columns variable name, outlier for the column vector and x-axis label. It then create density plot for the quantitative variable and returns the plot.
#Function to create Density plot of qunatitative variables and marking the outliers.
plotmodel <- function(df2,exp,z,xnames) {
p1<-ggplot(df2,aes_string(exp)) + stat_density(geom="line") + xlab(xnames)+ylab("Density \n")+ geom_point(data=z,aes(x,y),shape=23)
return(p1)
}
To generate density plot of Infection_Risk column, two functions - getoutandquant and plotmodel are used which respectively return the outliers and density plot of the Infection_Risk column.
Analyzing the density plot suggest that more people number acquire infection with probability between 4% and 5%. Furthermore, there are less number people who has probability of getting infection with value more or less than 4-5. Although more data would be helpful in narrowing why the peak is between 4-5 but intuitvely it can be concluded that naturally in a system there would be less number of people with extremely high and low infection whereas the common infection would be more among people.
Converted the graph from ggplot2 to plotly using ggplotly function. The benefit of using ggplotly is that we can hover around the graph and see the values. The amount of information is more available than using ggplot2.
Using scatterplot to find relation between number of nurses and infection risk based on number of beds informs that irrespective of the average number of beds or nurses, most of the patient will acquire infection from hospitals with probability of 0.04% to 0.06% . Additonally the average number of beds do increase with number of nurses which might possible suggest that, that particular hospitals corresponding to more nurses deals with more patients. The possible danger with using this scatterplot with color scale is difficult to make distinction and the clear boundaries are not set.
Without using ggplot2, a histogram of Infection_risk column is generated with plotly, pipeline and various arguments. From the histogram , it can be seen that most number of people tends to get infection with probability around 4% to 5%. Furthermore, the diamond shape points represent the outliers in the histogram.
Dynamic plot is generated for all graph using shiny app. On checking the optimal bandwidth, it is seen that the information gets scrambled when the bandwidth is too low whereas there is a loss of information if the bandwdith becomes too large. The optimal bandwidth to see all graph is around 2.1 .
library(shiny)
library(ggplot2)
library(tidyverse)
df<- read.table("SENIC.txt",colClasses = c("NULL",
rep(NA,6),"NULL","NULL",rep(NA,3)))
#Giving names to column
colnames(df)<-c("Length_Stay","Age","Infection_Risk","Culture","Chest_X_ray",
"No_Beds","Census","Nurses_num","Facility")
ui<-fluidPage(
titlePanel("Density Plots of Quantitative Variables"),
sidebarLayout( sidebarPanel( sliderInput("bw","Slide to change bandwidth
of Plot",min=0.1,max=10,value=3,step=0.2,animate=TRUE),
checkboxGroupInput("variableinp","Choose variables",
choices=colnames(df),selected = colnames(df)[1]),verbatimTextOutput("value")
),
mainPanel( plotOutput("densityplot"))
)
)
server<-function(input,output){
# observeEvent(input$variableinp, {
# print((input$variableinp))
# })
output$densityplot <- renderPlot({
if(!is.null(input$variableinp)) {
getoutandquant <- function(x) {
q1<-quantile(x)[[2]]
q3<-quantile(x)[[4]]
IQR<-q3-q1
out1<-q3+(1.5)*IQR
out2<-q1-(1.5)*IQR
#Finding the list of points which are outliers for a particular variable.
out<-x[x>out1]
out2<-x[x<out2]
outliers<-tibble(x=c(out,out2),y=0)
return(outliers)
}
nplot<-length(input$variableinp)
x<-input$variableinp
p<-list()
for ( i in 1:nplot) {
outlier<-getoutandquant(df[,x[i]])
p[[i]]<-ggplot(df,aes_string(x[i]))+
stat_density(geom="line",adjust=input$bw)+ ylab("Density\n")+
geom_point(data=outlier,aes(x,y),shape=23)
}
do.call(grid.arrange,p)
}
})
}
shinyApp(ui=ui,server=server)
knitr::include_graphics("treevis.png")
library(gridExtra)
library(tidyverse)
library(plotly)
library(grid)
#Read table
df<- read.table("A:/LiU/Visualization/Lab1/SENIC.txt",colClasses = c("NULL",
rep(NA,6),"NULL","NULL",rep(NA,3)))
#Giving names to column
colnames(df)<-c("Length_Stay","Age","Infection_Risk","Culture","Chest_X_ray","No_Beds","Census","Nurses_num","Facility")
infectoutlier <- getoutandquant(df$Infection_Risk)
plotinfec <-plotmodel(df,"Infection_Risk",infectoutlier,"Infection Risk")
plotinfec
#Since we have evaluated the Infection_risk column we can remove it from the dataframe.
df2 <- df[, colnames(df)!="Infection_Risk"]
le<-length(df2)
#Getting the set of outliers for all quantitative variables
z<-apply(df2,2,getoutandquant)
#Giving names so that a good xlabels values can be shown.
xnames<-c("Length of Stays(Days)","Age(Yrs)","Routine Culture Ratio","Routine Chest_X_ray Ratio","Avg Number of Beds","Average Daily Census","Avg Number of Nurses","Avg Facilities")
#Making a plot of empty list.
myplot <- list()
for (i in 1:le) {
myplot[[i]]<-plotmodel(df2,colnames(df2)[i],z[[i]],xnames[i])
}
title=textGrob("Density Plot of SENIC Datasets",gp=gpar(fontface="bold"))
grid.arrange(grobs=c(list(plotinfec),myplot),top=title)
sp<-ggplot(df,aes(Nurses_num,Infection_Risk,color=No_Beds)) +geom_point() + xlab("\n Number of Nurses") + ylab("Infection risk\n")
sp+scale_color_gradientn(colours = rainbow(5))
x<-as.list(infectoutlier[,1])
plot_ly(x= ~df$Infection_Risk,type="histogram",name="Histogram",marker=list(line=list(color="black",width=1)))%>% add_markers(x=x[[1]],y=0,name="Outliers",marker=list(symbol="diamond",size=7))%>% layout(title="Histogram of Infection Risk",xaxis=list(title="Infection Risk"),yaxis=list(title="Frequency"))